Search CORE

Carleton University's Institutional Repository

On finding minimal absent words

Author: Armando J Pinho
C Acquisti
D Gusfield
DK Kim
DK Kim
E Ukkonen
EM McCreight
F Shi
G Hampikian
J Herold
J Kärkkäinen
João MOS Rodrigues
M Burrows
MI Abouelhoda
MI Abouelhoda
P Weiner
Paulo JSG Ferreira
S Kurtz
Sara P Garcia
T Kasai
U Manber
U Manber
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The problem of finding the shortest absent words in DNA data has been recently addressed, and algorithms for its solution have been described. It has been noted that longer absent words might also be of interest, but the existing algorithms only provide generic absent words by trivially extending the shortest ones. Results We show how absent words relate to the repetitions and structure of the data, and define a new and larger class of absent words, called minimal absent words, that still captures the essential properties of the shortest absent words introduced in recent works. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. We describe an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. An implementation of this algorithm is publicly available at <url>ftp://www.ieeta.pt/~ap/maws</url>. Conclusion Because the set of minimal absent words that we propose is much larger than the set of the shortest absent words, it is potentially more useful for applications that require a richer variety of absent words. Nevertheless, the number of minimal absent words is still manageable since it grows at most linearly with the string size, unlike generic absent words that grow exponentially. Both the algorithm and the concepts upon which it depends shed additional light on the structure of absent words and complement the existing studies on the topic.</p

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences

Author: A Lempel
A Puglisi
Andrew K Benson
CG Nevill-Manning
David J Russell
DR Bastola
E Ukkonen
EK Costello
EM McCreight
HH Otu
J Ziv
J Ziv
JD Parsons
JD Thompson
Khalid Sayood
L Holm
M Charikar
M Halkidi
P Weiner
RC Edgar
Samuel F Way
SF Altschul
W Li
W Li
W Li
WJ Wilbur
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Background: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created. Results: The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets. Conclusions: We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences

DigitalCommons@University of Nebraska

An efficient algorithm for systematic analysis of nucleotide strings suitable for siRNA design

Author: A Apostolico
A Verdel
AC Hsieh
AL Jackson
AL Jackson
AM Chalk
Ancha Baranova
CF Hung
E Ukkonen
EM McCreight
F Fernandes
F Tilesi
Ganiraju Manyam
IT Li
J Na
Jonathan Bode
K Ui-Tei
M Scherr
Maria Emelianenko
MH Schulz
P Saetrom
P Svoboda
P Weiner
PB Hajeri
PC Scacheri
R Giegerich
SA Manavski
T Alsheddi
W Cui
X Dai
Y Naito
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The "off-target" silencing effect hinders the development of siRNA-based therapeutic and research applications. Existing solutions for finding possible locations of siRNA seats within a large database of genes are either too slow, miss a portion of the targets, or are simply not designed to handle a very large number of queries. We propose a new approach that reduces the computational time as compared to existing techniques. Findings The proposed method employs tree-based storage in a form of a modified truncated suffix tree to sort all possible short string substrings within given set of strings (i.e. transcriptome). Using the new algorithm, we pre-computed a list of the best siRNA locations within each human gene ("siRNA seats"). siRNAs designed to reside within siRNA seats are less likely to hybridize off-target. These siRNA seats could be used as an input for the traditional "set-of-rules" type of siRNA designing software. The list of siRNA seats is available through a publicly available database located at <url>http://web.cos.gmu.edu/~gmanyam/siRNA_db/search.php</url> Conclusions In attempt to perform top-down prediction of the human siRNA with minimized off-target hybridization, we developed an efficient algorithm that employs suffix tree based storage of the substrings. Applications of this approach are not limited to optimal siRNA design, but can also be useful for other tasks involving selection of the characteristic strings specific to individual genes. These strings could then be used as siRNA seats, as specific probes for gene expression studies by oligonucleotide-based microarrays, for the design of molecular beacon probes for Real-Time PCR and, generally, any type of PCR primers.</p

The design and implementation of service process reconfiguration with end-to-end QoS constraints in SOA

Author: Bin Xu
D Ardagna
EM McCreight
J Vanhatalo
Jing Zhang
K Lin
Kwei-Jay Lin
L Zeng
L Zeng
M Bichler
MP Papazoglou
R Rouvoy
T Yu
Y Zhang
Yanlong Zhai
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

A basic analysis toolkit for biological sequences

Author: A Aho
A Amir
A Apostolico
A Czumaj
Alessandro Siragusa
B Schieber
BS Baker
D Eppstein
D Eppstein
D Gusfield
D Hirschberg
E Ukkonen
E Ukkonen
EM McCreight
Enrico Siragusa
EW Myers
Filippo Utro
G Landau
G Landau
J Hunt
K Mehlhorn
M Dayhoff
M Leung
M Waterman
M Waterman
MM Klawe
O Gotoh
R Cole
R Giancarlo
Raffaele Giancarlo
S Altshul
S Henikoff
S Henikoff
S Sinha
S Sinha
W Fitch
W Goad
W Miller
Z Galil
Z Galil
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks. Namely, local alignments, via approximate string matching, and global alignments, via longest common subsequence and alignments with affine and concave gap cost functions. Moreover, it also supports filtering operations to select strings from a set and establish their statistical significance, via z-score computation. None of the algorithms is new, but although they are generally regarded as fundamental for sequence analysis, they have not been implemented in a single and consistent software package, as we do here. Therefore, our main contribution is to fill this gap between algorithmic theory and practice by providing an extensible and easy to use software library that includes algorithms for the mentioned string matching and alignment problems. The library consists of C/C++ library functions as well as Perl library functions. It can be interfaced with Bioperl and can also be used as a stand-alone system with a GUI. The software is available at under the GNU GPL

Archivio istituzionale della ricerca - Università di Palermo

Accelerating Sequence Searching: Dimensionality Reduction Method

Author: Baihua Zheng
Bin Cui
D Gusfield
DA Benson
Dongqing Yang
E Keogh
E Keogh
EM McCreight
G Navarro
G Navarro
GH Golub
Guojie Song
IT Jolliffe
Kunqing Xie
M Li
S Kadiyala
TH Cormen
W Pearson
Yu DanTong
Z Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Similarity search over long sequence dataset becomes increasingly popular in many emerging applications, such as text retrieval, genetic sequences exploring, etc. In this paper, a novel index structure, namely Sequence Embedding Multiset tree (SEM - tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods.Computer Science, Artificial IntelligenceComputer Science, Information SystemsSCI(E)6ARTICLE3301-3222

Institutional Knowledge at Singapore Management University

Statistical significance of cis-regulatory modules

Author: A Kel
A Klingenhoff
A Sandelin
A Sosinsky
A Wagner
A Wagner
A Wagner
A Webber
AA Philippakis
Andrew D Smith
AP Lifanov
BP Berman
BP Berman
D GuhaThakurta
DS Johnson
Dustin E Schones
E Eskin
EM McCreight
F Tronche
G Hertz
GD Stormo
J van Helden
JM Claverie
JM Claverie
JS Liu
K Struhl
M Beckstette
M Beckstette
M Blanchette
M Gupta
MA Beer
MC Frith
MC Frith
Michael Q Zhang
N Munshi
N Nagarajan
N Rajewsky
O Johansson
P Leighton
Q Zhou
R Hoberman
R Hoberman
R Staden
RR Sokal
S Aerts
S Rahmann
S Sinha
TD Schneider
TL Bailey
TL Bailey
TL Baily
V Matys
W Kent
W Thompson
WB Alkema
WW Wasserman
YH Grad
Z Xuan
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: It is becoming increasingly important for researchers to be able to scan through large genomic regions for transcription factor binding sites or clusters of binding sites forming cis-regulatory modules. Correspondingly, there has been a push to develop algorithms for the rapid detection and assessment of cis-regulatory modules. While various algorithms for this purpose have been introduced, most are not well suited for rapid, genome scale scanning. RESULTS: We introduce methods designed for the detection and statistical evaluation of cis-regulatory modules, modeled as either clusters of individual binding sites or as combinations of sites with constrained organization. In order to determine the statistical significance of module sites, we first need a method to determine the statistical significance of single transcription factor binding site matches. We introduce a straightforward method of estimating the statistical significance of single site matches using a database of known promoters to produce data structures that can be used to estimate p-values for binding site matches. We next introduce a technique to calculate the statistical significance of the arrangement of binding sites within a module using a max-gap model. If the module scanned for has defined organizational parameters, the probability of the module is corrected to account for organizational constraints. The statistical significance of single site matches and the architecture of sites within the module can be combined to provide an overall estimation of statistical significance of cis-regulatory module sites. CONCLUSION: The methods introduced in this paper allow for the detection and statistical evaluation of single transcription factor binding sites and cis-regulatory modules. The features described are implemented in the Search Tool for Occurrences of Regulatory Motifs (STORM) and MODSTORM software

Cold Spring Harbor Laboratory Institutional Repository

Metformin and the gastrointestinal tract

Author: A Ait-Omar
A Everard
A Napolitano
AJ Mulherin
AJ Scheen
AK Madiraju
BD Green
C Beysen
C Thomas
C Wilcock
CJ Bailey
CJ Bailey
CJ Bailey
CJ Bailey
CJ Bailey
Clifford J. Bailey
D Carter
D Carter
D Stepensky
E Gontier
E Mannucci
EM Migoya
Ewan R. Pearson
F Lien
F Yi
FA Duca
FH Karlsson
G Firneisz
GG Graham
GP Fadini
GT Tucker
H Koepsell
H Lee
H Tilg
I Vardarli
J Cuthbertson
J Davidson
J Müller
J Qin
J Walker
JD Lalau
JH Burton
JH Scarpello
JR Lindsay
JR Oh
K Engel
K Engel
K Forslund
K Zhou
K Zhou
KY Hur
L Penicaud
Laura J. McCreight
LX Cubeddu
M Zhou
MH Kim
MJ Zema
MM Christensen
MM Christensen
MM Christensen
MR Koehler
N Lee
N Yasuda
NR Shin
PH Marathe
PV Röder
RA Jackson
RI Misbin
RJ Naftalin
S Capitanio
S Chen
SA Hinke
SE Innzucchi
SI Inzucchi
SK Thondam
SS Gambhir
SW Yee
T Dujic
T Wu
TB Stage
TK Han
TK Han
TM Davis
UK Prospective Diabetes Study (UKPDS) Group
V Gorboulev
WF Caspary
WR Proctor
X Zhang
Y Sakar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Metformin is an effective agent with a good safety profile that is widely used as a first-line treatment for type 2 diabetes, yet its mechanisms of action and variability in terms of efficacy and side effects remain poorly understood. Although the liver is recognised as a major site of metformin pharmacodynamics, recent evidence also implicates the gut as an important site of action. Metformin has a number of actions within the gut. It increases intestinal glucose uptake and lactate production, increases GLP-1 concentrations and the bile acid pool within the intestine, and alters the microbiome. A novel delayed-release preparation of metformin has recently been shown to improve glycaemic control to a similar extent to immediate-release metformin, but with less systemic exposure. We believe that metformin response and tolerance is intrinsically linked with the gut. This review examines the passage of metformin through the gut, and how this can affect the efficacy of metformin treatment in the individual, and contribute to the side effects associated with metformin intolerance